Add 5xx retry policy for transient server errors#26821
Merged
simorenoh merged 7 commits intoJun 1, 2026
Merged
Conversation
Adds a retry policy for 500, 502, and 504 responses in the azcosmos client retry pipeline. Only read operations are retried, which is consistent with the .NET, Java, and Python Cosmos SDKs. The retry budget is one in-region retry followed by one cross-region retry. The cross-region retry only fires when cross-region retries are enabled and a preferred location is available to fail over to. Fixes Azure#25639. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds transient 5xx retry handling to the Cosmos client retry policy so read requests can recover from certain server-side failures while ensuring write requests are not retried.
Changes:
- Added retry handling for HTTP
500,502, and504inclientRetryPolicy, with an in-region retry followed by an optional cross-region retry. - Introduced
serverErrorRetryCountto track the 5xx retry budget independently of other retry counters. - Added unit tests for read/write behavior on these 5xx responses and documented the change in the package changelog.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| sdk/data/azcosmos/cosmos_client_retry_policy.go | Implements 5xx retry logic and tracking in the client retry policy. |
| sdk/data/azcosmos/cosmos_client_retry_policy_test.go | Adds tests validating read retries and ensuring writes are not retried for 5xx. |
| sdk/data/azcosmos/CHANGELOG.md | Notes the new 5xx retry behavior in the unreleased changelog. |
Adds tests covering scenarios the original PR did not exercise: * In-region retry succeeds without a cross-region attempt. * Cross-region retries disabled - only the in-region retry runs. * Exhausted retries surface `*azcore.ResponseError` with the original status code. * 501 (Not Implemented) is not retried. * 500 interleaved with 503 retries correctly across regions. Together with the existing tests this brings `attemptRetryOnServerError` and `shouldRetryStatus` to 100% statement coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses review feedback on the 5xx retry policy. Previously the cross-region retry was gated only on `enableCrossRegionRetries` and the length of `preferredLocations`, so single-region accounts (or configurations where the only resolved read endpoint matches the current endpoint) would burn the cross-region retry on the same URL without actually failing over. The cross-region retry now also requires more than one resolved read endpoint in the location cache. The mock location-cache helper used in tests is updated to populate `readEndpoints` / `writeEndpoints` per available location so existing cross-region retry tests continue to exercise the failover path, and a new test pins the single-endpoint behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3
approved these changes
May 29, 2026
…icts Re-applied the 5xx (500/502/504) read-retry additions on top of upstream's refactored clientRetryPolicy (cross-region connection-error failover + 408 handling): serverErrorRetryCount field, maxServerErrorRetryCount, the 5xx switch case with advanceLocation gating, shouldRetryStatus entries, and attemptRetryOnServerError. Added locationCache.readEndpointCount() so the cross-region 5xx guard reads readEndpoints under RLock. Re-added the server-error tests, injecting multiple read endpoints inline (via multiReadEndpointLC) instead of mutating the shared CreateMockLC. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
simorenoh
reviewed
Jun 1, 2026
tvaron3
reviewed
Jun 1, 2026
simorenoh
approved these changes
Jun 1, 2026
simorenoh
reviewed
Jun 1, 2026
tvaron3
reviewed
Jun 1, 2026
Co-authored-by: Simon Moreno <30335873+simorenoh@users.noreply.github.com>
Address PR review feedback: - Add TestReadServerError_CrossRegionRoutesToDifferentEndpoint, which backs the in-region and next-preferred read endpoints with two distinct mock servers and a host-routing transport. It asserts the first two attempts hit the in-region endpoint and the cross-region retry hits the second, distinct endpoint -- proving failover changes the request target, not just retry counters. - Update the 5xx CHANGELOG entry to link the PR instead of the issue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conflict-resolution merge had reverted this workflow file to the PR branch's older unquoted GITHUB_ENV form to avoid a workflow-scope push restriction, which left the PR showing an unintended diff. Restore it to upstream/main verbatim so the PR no longer modifies the workflow file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
10c56ee to
e20bea0
Compare
simorenoh
reviewed
Jun 1, 2026
Member
There was a problem hiding this comment.
In this PR: #26820 we changed the status codes list passed in to exclude 429s so that we would own all the logic on our end only and exclude Core's logic. Do we need to do the same for the 500/502/504 in this one? - EDIT: Nevermind, just realized we wrap in non-retryable so should be fine.
simorenoh
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a retry policy in the
clientRetryPolicyfor500,502, and504responses. Only read operations are retried; writes are surfaced to the caller immediately.The retry budget is one in-region retry followed by one cross-region retry. The cross-region retry only fires when cross-region retries are enabled and a preferred location is available to fail over to. After both attempts are exhausted (or skipped), the original
5xxresponse is returned wrapped as a non-retriable error, matching the pattern already used for503/404/403in this file.Fixes #25639.